Contents

1 Background

In this vignette we assess the performance of MSqRob for differential expression analysis using a publicly available spikein study (PRIDE identifier: PXD003881 Shen et al. [2018]). E. Coli lysates were spiked at five different concentrations (3%, 4.5%, 6%, 7.5% and 9% wt/wt) in a stable human background (four replicates per treatment). The samples were run on an Orbitrap Fusion mass spectrometer. Raw data files were processed with MaxQuant (version 1.6.1.0, Cox and Mann [2008]) using default search settings unless otherwise noted. Spectra were searched against the UniProtKB/SwissProt human and E. Coli reference proteome databases (07/06/2018), concatenated with the default Maxquant contaminant database. Carbamidomethylation of Cystein was set as a fixed modification, and oxidation of Methionine and acetylation of the protein amino-terminus were allowed as variable modifications. In silico cleavage was set to use trypsin/P, allowing two miscleavages. Match between runs was also enabled using default settings. The resulting peptide-to-spectrum matches (PSMs) were filtered by MaxQuant at 1% FDR.

concentrations <- c(a=3,b=4.5,c=6,d=7.5,e=9)

2 Data

We first import the peptides.txt file. This is the file that contains your peptide-level intensities. For a MaxQuant search [6], this peptides.txt file can be found by default in the “path_to_raw_files/combined/txt/” folder from the MaxQuant output, with “path_to_raw_files” the folder where raw files were saved. In this tutorial, we will use a MaxQuant peptides file of the ecoli spike in study that is stored in the msdata package. We will use the Features package to import the data.

With the grepEcols function we find the columns that are containing the expression data of the peptides in the peptides.txt file.

library(tidyverse)
library(limma)
library(Features)
library(msqrob2)

myurl<- "https://www.dropbox.com/s/62hhol96pz0sjbi/peptides.zip?dl=0"
download.file(myurl,"peptides.zip",method="curl",extra="-L")
unzip("peptides.zip")
peptidesFile <- "peptides.txt"
ecols <- MSnbase::grepEcols(peptidesFile, "Intensity ", split = "\t")
pe <- readFeatures(table = peptidesFile, fnames = 1, ecol = ecols,
                   name = "peptideRaw", sep="\t")
pe
## A instance of class Features containing
##  [1] peptideRaw: SummarizedExperiment with 32827 rows and 20 columns

We can extract the spikein condition from the raw file name.

cond <- which(strsplit(colnames(pe)[[1]][1], split = "")[[1]] == "a") # find where condition is stored
colData(pe)$condition <- substr(colnames(pe), cond, cond) %>% unlist %>%  as.factor

We calculate how many non zero intensities we have per peptide. This will be useful for filtering.

rowData(pe[["peptideRaw"]])$nNonZero <- rowSums(assay(pe[["peptideRaw"]]) > 0)

Peptides with zero intensities are missing peptides and should be represent with a NA value instead of 0.

pe <- zeroIsNA(pe,"peptideRaw")

2.1 Information on species

In the spikin-study there are peptides from ecoli and human proteins. The ecoli peptides are spiked.

myurl<- "https://www.dropbox.com/s/swhu9nqwktythtb/ecoli_up000000625_7_06_2018.fasta?dl=0"
download.file(myurl,"ecoli.fasta",method="curl",extra="-L")
myurl<-"https://www.dropbox.com/s/n61n28wrcpwsb4a/human_up000005640_sp_7_06_2018.fasta?dl=0"
download.file(myurl,"human.fasta",method="curl",extra="-L")
id <- list(ecoli = 'ecoli.fasta',
          human = 'human.fasta') %>%
  map(~{read_lines(.x) %>%
          {.[str_detect(.,'^>')]} %>%
          str_extract(.,'(?<=\\|).*(?=\\|)')})

2.2 Data exploration

We can inspect the missingness in our data with the plotNA() function provided with MSnbase. 23% of all peptide intensities are missing and for some peptides we don’t even measure a signal in any sample. The missingness is similar across samples. Note, that we plot the peptide data, so the label protein in the plot refers to peptides.

MSnbase::plotNA(assay(pe))

3 Preprocessing

We will normalize the data using vsn normalisation. Note, that the data should not be log-transformed then.

3.1 Filtering

3.1.1 Handling overlapping protein groups

In our approach a peptide can map to multiple proteins, as long as there is none of these proteins present in a smaller subgroup.

pe[["peptideRaw"]]<-pe[["peptideRaw"]][rowData(pe[["peptideRaw"]])$Proteins %in% smallestUniqueGroups(rowData(pe[["peptideRaw"]])$Proteins),]

3.1.2 Remove reverse sequences (decoys) and contaminants

We now remove the contaminants, peptides that map to decoy sequences and proteins, which were only identified by peptides with modifications.

pe[["peptideRaw"]] <- pe[["peptideRaw"]][rowData(pe[["peptideRaw"]])$Reverse!= "+", ]
pe[["peptideRaw"]] <- pe[["peptideRaw"]][rowData(pe[["peptideRaw"]])$
Potential.contaminant!="+", ]

3.1.3 Drop peptides that were only identified in one sample

We want to keep peptide that were at least observed twice.

pe[["peptideRaw"]] <- pe[["peptideRaw"]][rowData(pe[["peptideRaw"]])$nNonZero >= 2, ]
nrow(pe[["peptideRaw"]])
## [1] 28944

We keep 28944 peptides upon filtering.

3.2 Normalize the data using the vsn method

pe <- normalize(pe,i="peptideRaw",method="vsn",name="peptideNorm")
## vsn2: 28944 x 20 matrix (1 stratum).
## Please use 'meanSdPlot' to verify the fit.

3.3 Explore vsn normalized data

Upon normalisation the density curves for all samples coincide.

limma::plotDensities(assay(pe[["peptideNorm"]]))

We can visualize our data using a Multi Dimensional Scaling plot, eg. as provided by the limma package.

limma::plotMDS(assay(pe[["peptideNorm"]]), col = as.numeric(colData(pe)$condition))

The first axis in the plot is showing the leading log fold changes (differences on the log scale) between the samples. We notice that the leading differences (log FC) in the peptide data seems to be driven by the spike-in condition.

3.4 Summarization to protein level

Use the standard summarisation in aggregateFeatures ( robust summarisation).

pe <- aggregateFeatures(pe, i="peptideNorm", fcol = "Proteins", na.rm = TRUE, name="protein")
## Your quantitative and row data contain missing values. Please read
## the relevant section(s) in the aggregateFeatures manual page
## regarding the effects of missing values on data aggregation.
plotMDS(assay(pe[["protein"]]),col = as.numeric(colData(pe)$condition))

4 Data Analysis

4.1 Estimation

We model the protein level expression values using msqrob. By default msqrob2 estimates the model parameters using robust regression.

pe <- msqrob(object=pe,i="protein", formula=~condition)

4.2 Inference

What are the parameter names of the model?

getCoef(rowData(pe[["protein"]])$msqrobModels[[1]])
## (Intercept)  conditionb  conditionc  conditiond  conditione 
## 23.44881438  0.08851106  0.03863228  0.01051063 -0.07124313

Spike-in condition a is the reference class. So the mean log2 expression for samples from condition a is ‘(Intercept). The mean log2 expression for samples from condition b-e is’(Intercept)+conditionb’,…,‘(Intercept)+conditione’, respectively. Hence the average log2 fold change (FC) between condition b and condition a is modelled using the parameter ‘conditionb’. Hence we will assess the contrast ‘conditionb = 0’ with our statistical test. The same holds for comparison c-a, d-a, e-a.

comparisonsRef<- paste0(paste0("condition",letters[2:5])," = 0")
comparisonsRef
## [1] "conditionb = 0" "conditionc = 0" "conditiond = 0" "conditione = 0"

The test for average log2 FC between condition c and b will assess the contrast ‘conditionc - conditionb = 0’, …

comparisonsOther<-paste0(
    apply(
          combn(paste0("condition",letters[2:5]),2)[2:1,],
          2,
          paste,
          collapse=" - ")
          ," = 0")
comparisonsOther
## [1] "conditionc - conditionb = 0" "conditiond - conditionb = 0"
## [3] "conditione - conditionb = 0" "conditiond - conditionc = 0"
## [5] "conditione - conditionc = 0" "conditione - conditiond = 0"
comparisons<-c(comparisonsRef,comparisonsOther)

We make the contrast matrix using the makeContrast function

L <- makeContrast(comparisons,parameterNames=paste0("condition",letters[2:5]))
L
##            conditionb conditionc conditiond conditione
## conditionb          1          0          0          0
## conditionc          0          1          0          0
## conditiond          0          0          1          0
## conditione          0          0          0          1
##            conditionc - conditionb conditiond - conditionb
## conditionb                      -1                      -1
## conditionc                       1                       0
## conditiond                       0                       1
## conditione                       0                       0
##            conditione - conditionb conditiond - conditionc
## conditionb                      -1                       0
## conditionc                       0                      -1
## conditiond                       0                       1
## conditione                       1                       0
##            conditione - conditionc conditione - conditiond
## conditionb                       0                       0
## conditionc                      -1                       0
## conditiond                       0                      -1
## conditione                       1                       1

And we adopt the hypothesis tests for each contrast.

pe <- hypothesisTest(object=pe,i="protein",contrast=L)

4.3 Top tables for contrasts

Here, we show the 6 most DE proteins for the comparison b-a.

rowData(pe[["protein"]]) %>%
          .$"conditionb" %>%
          rownames_to_column("protein") %>%
          arrange(pval) %>%
          column_to_rownames("protein") %>%
          head
##            logFC         se       df        t         pval      adjPval
## P0A799 0.6536363 0.03617586 16.82604 18.06830 1.878796e-12 7.186060e-09
## P0A853 0.5943575 0.03263387 16.19385 18.21290 3.264165e-12 7.186060e-09
## P0A7J3 0.6367692 0.03710347 16.49710 17.16199 5.984403e-12 8.783109e-09
## P0A7S9 0.6940529 0.04175391 15.93352 16.62247 1.735147e-11 1.909963e-08
## P0A6M8 0.6141933 0.03858488 16.18665 15.91798 2.609115e-11 2.297587e-08
## P0C0L2 0.6438649 0.04024479 15.57908 15.99871 4.374996e-11 3.210518e-08

We do the same for comparison c-b.

rowData(pe[["protein"]]) %>%
          .$"conditionc - conditionb" %>%
          rownames_to_column("protein") %>%
          arrange(pval) %>%
          column_to_rownames("protein") %>%
          head
##            logFC         se       df        t         pval      adjPval
## P0A799 0.4477370 0.03653723 16.82604 12.25427 8.272065e-10 1.705967e-06
## P60723 0.4860439 0.03805471 15.96275 12.77224 8.537166e-10 1.705967e-06
## P0A6F3 0.4742037 0.03685706 15.43962 12.86602 1.162367e-09 1.705967e-06
## P0A853 0.3906772 0.03334316 16.19385 11.71686 2.533679e-09 2.788947e-06
## P0A6M8 0.4407012 0.03858488 16.18665 11.42160 3.690290e-09 3.172843e-06
## P0A7J3 0.4162573 0.03737184 16.49710 11.13826 4.323656e-09 3.172843e-06

4.4 Plots

4.4.1 Condition b vs condition a

4.4.1.1 Volcano-plot

volcano <- rowData(pe[["protein"]]) %>%
    .$"conditionb"%>%
    ggplot(aes(x = logFC, y = -log10(pval),
           color = adjPval < 0.01)) +
    geom_point(cex = 2.5) +
    scale_color_manual(values = alpha(c("black", "red"), 0.5)) + theme_minimal() +
    geom_vline(xintercept=log2(concentrations["b"]/concentrations["a"]),col="red")
volcano

Note, that the majority of the significant DE proteins are upregulated. In the spike-in study all ecoli proteins are indeed present at a higher concentration in condition b than in condition a. We also observe that the DE proteins are indeed close to the spiked in fold change of 0.58. The log2 FC of the spike in proteins is indicated with the vertical read line in the plot.

4.4.1.2 Heatmap

We first select the names of the significant proteins.

sigNames <- rowData(pe[["protein"]]) %>%
    .$"conditionb" %>%
    rownames_to_column("protein") %>% filter(adjPval<0.01) %>% pull(protein)
heatmap(assay(pe[["protein"]])[sigNames, ])

Indeed, the majority of the DE proteins at 1% FDR seem to be spiked.

4.4.1.3 Detail plots

We first extract the normalized peptide expression values for a particular protein.

for (protName in sigNames[1:5])
{
pePlot <- pe[protName,,c("peptideNorm","protein")]
pePlotDf <- data.frame(longFormat(pePlot))
pePlotDf$assay <- factor(pePlotDf$assay,
                        levels = c("peptideNorm", "protein"))
pePlotDf$condition <- as.factor(colData(pePlot)[pePlotDf$colname, "condition"])
p1 <- ggplot(data = pePlotDf,
       aes(x = colname,
           y = value,
           group = rowname)) +
    geom_line() + geom_point() +  theme_minimal() +
    facet_grid(~ assay) + ggtitle(protName)
print(p1)

p2 <- ggplot(pePlotDf, aes(x = colname, y = value, fill = condition)) + geom_boxplot(outlier.shape = NA) +
geom_point(position = position_jitter(width = .1), aes(shape = rowname)) + scale_shape_manual(values = 1:nrow(pePlotDf)) +
labs(title = protName, x = "sample", y = "Peptide intensity (log2)") + theme_minimal()
facet_grid(~assay)
print(p2)
}

4.4.1.4 Sensitivity FDP plot

Because we are analysing a spike-in study we know the ground truth, i.e. we know that only the spike-in proteins (ecoli) are differentially expressed. We can therefore evaluate the performance of the method, i.e. we will assess

  • the sensitivity or true positive rate (TPR), the proportion of actual positives that are correctly identified, in the protein list that we return \[TPR=\frac{TP}{\text{#actual positives}},\] here TP are the true positives in the list. The TPR is thus the fraction of ups proteins that we can recall.

  • false discovery proportion (FPD): fraction of false positives in the protein list that we return: \[FPD=\frac{FP}{FP+TP},\] with FP the false positives. In our case the yeast proteins that are in our list.

Instead of only calculating that for the protein list that is returned for the chosen FDR level, we can do this for all possible FDR cutoffs so that we get an overview of the quality of the ranking of the proteins in the protein list.

We first add the ground truth data to the rowData of the object.

accessions <- rownames(pe[["protein"]]) %>%
    data_frame(protein=.)

accessions <- accessions %>%
    transmute(protein=as.character(protein),proteins = strsplit(protein, ';')) %>%
    unnest %>%
    mutate(human = proteins %in% id$human, ecoli =  proteins %in% id$ecoli) %>%
    group_by(protein) %>%
    summarise(human = any(human), ecoli = any(ecoli)) %>%
    right_join(accessions)
## Joining, by = "protein"
rowData(pe[["protein"]])$accession<-accessions

Check that all accessions are either human or ecoli:

nrow(accessions)
## [1] 4710
sum(accessions$human)
## [1] 3954
sum(accessions$ecoli)
## [1] 756
sum(accessions$human) + sum(accessions$ecoli)
## [1] 4710

Function to calculate TPR and FDP

tprFdp<-function(pval,tp,adjPval){
ord<-order(pval)
return(data.frame(
  pval=pval[ord],
  adjPval=adjPval[ord],
  tpr=cumsum(tp[ord])/sum(tp),
  fdp=cumsum(!tp[ord])/1:length(tp)))
}

Performance Plot

tprFdpCompBa <- tprFdp(rowData(pe[["protein"]])$"conditionb"$pval,
             rowData(pe[["protein"]])$accession$ecoli,rowData(pe[["protein"]])$"conditionb"$adjPval)
tprFdpPlotBa <- tprFdpCompBa %>%
      ggplot(aes(x=fdp,y=tpr)) +
      geom_path() +
      geom_vline(xintercept=0.01,lty=2) + geom_point(data=tprFdpCompBa[sum(tprFdpCompBa$adjPval<0.01,na.rm=TRUE),],aes(x=fdp,y=tpr),cex=2)
tprFdpPlotBa

tprFdpPlotBa + xlim(0,.25)
## Warning: Removed 4038 rows containing missing values (geom_path).

The toplist that is returned at the 1% FDR level is indeed close to 1% FDP, indicating that the FDR level is well controlled. (see dot on performance plot).

4.4.2 Other contrasts

contrastNames<-colnames(L)
FCs<-apply(
           combn(letters[1:5],2),
           2,
           function(x) concentrations[x[2]]/concentrations[x[1]]
          )
names(FCs)<-contrastNames

4.4.2.1 Volcano-plots

volcano<-list()
for (i in contrastNames)
{
volcano[[i]] <- rowData(pe[["protein"]])[[i]] %>%
    ggplot(aes(x = logFC, y = -log10(pval),
           color = adjPval < 0.01)) +
    geom_point(cex = 2.5) +
    scale_color_manual(values = alpha(c("black", "red"), 0.5)) + theme_minimal() +
    geom_vline(xintercept=log2(FCs[i]),col="red") +
    ggtitle(i)
}
volcano
## $conditionb

## 
## $conditionc

## 
## $conditiond

## 
## $conditione

## 
## $`conditionc - conditionb`

## 
## $`conditiond - conditionb`

## 
## $`conditione - conditionb`

## 
## $`conditiond - conditionc`

## 
## $`conditione - conditionc`

## 
## $`conditione - conditiond`

4.4.2.2 Heatmaps

for (i in contrastNames)
{
sigNames <- rowData(pe[["protein"]])[[i]] %>%
    rownames_to_column("protein") %>% filter(adjPval<0.01) %>% pull(protein)
heatmap(assay(pe[["protein"]])[sigNames, ],main=i)
}

We again observe that the majority of the proteins that are returned in all comparisons are spike-in proteins.

4.4.2.3 Sensitivity FDP plots

tprFdps<-list()
tprFdpPlots <- list()
for (i in contrastNames)
{
tprFdps[[i]] <- tprFdp(rowData(pe[["protein"]])[[i]]$pval,
             rowData(pe[["protein"]])$accession$ecoli,rowData(pe[["protein"]])[[i]]$adjPval)
tprFdpPlots[[i]] <- tprFdps[[i]] %>%
      ggplot(aes(x=fdp,y=tpr)) +
      geom_path() +
      geom_vline(xintercept=0.01,lty=2) + geom_point(data=tprFdps[[i]][sum(tprFdps[[i]]$adjPval<0.01,na.rm=TRUE),],aes(x=fdp,y=tpr),cex=2) +
      ggtitle(i)
}
tprFdpPlots
## $conditionb

## 
## $conditionc

## 
## $conditiond

## 
## $conditione

## 
## $`conditionc - conditionb`

## 
## $`conditiond - conditionb`

## 
## $`conditione - conditionb`

## 
## $`conditiond - conditionc`

## 
## $`conditione - conditionc`

## 
## $`conditione - conditiond`

The performance gets better when the logFC between the compared spike-in conditions increase.

Note, that the FDP is not close to the chosen FDR of 0.01 for the comparisons involving condition e. This is probably due to the competition effects because of spiking the ecoli proteins in at a relative high concentration.

5 Comparison with other workflows

5.1 Median summarisation

pe <- aggregateFeatures(pe,"peptideNorm",fcol = "Proteins", na.rm = TRUE, name="proteinMedian",fun=matrixStats::colMedians)
## Your quantitative and row data contain missing values. Please read
## the relevant section(s) in the aggregateFeatures manual page
## regarding the effects of missing values on data aggregation.
pe <- msqrob(object=pe,i="proteinMedian", formula=~condition)
pe <- hypothesisTest(object=pe,i="proteinMedian",contrast=L)

5.1.1 Volcano-plots

volcanoMed<-list()
for (i in contrastNames)
{
volcanoMed[[i]] <- rowData(pe[["proteinMedian"]])[[i]] %>%
    ggplot(aes(x = logFC, y = -log10(pval),
           color = adjPval < 0.01)) +
    geom_point(cex = 2.5) +
    scale_color_manual(values = alpha(c("black", "red"), 0.5)) + theme_minimal() +
    geom_vline(xintercept=log2(FCs[i]),col="red") +
    ggtitle(paste("median summarisation",i))
}
volcanoMed
## $conditionb

## 
## $conditionc

## 
## $conditiond

## 
## $conditione

## 
## $`conditionc - conditionb`

## 
## $`conditiond - conditionb`

## 
## $`conditione - conditionb`

## 
## $`conditiond - conditionc`

## 
## $`conditione - conditionc`

## 
## $`conditione - conditiond`

Note, that less proteins are found to be DE upon median summarisation.

5.1.2 Sensitivity FDP plot

Add accession slot to rowData of proteinMedian assay. (First check if same proteins are in both the protein and proteinMedian assay).

mean(rownames(pe[["proteinMedian"]])==rownames(pe[["protein"]]))
## [1] 1
rowData(pe[["proteinMedian"]])$accession<-rowData(pe[["protein"]])$accession
tprFdpMed<-list()
tprFdpPlotMed <- list()
for (i in contrastNames)
{
tprFdpMed[[i]] <- tprFdp(rowData(pe[["proteinMedian"]])[[i]]$pval,
             rowData(pe[["proteinMedian"]])$accession$ecoli,rowData(pe[["proteinMedian"]])[[i]]$adjPval)

hlp <- rbind(cbind(tprFdps[[i]],method="robustRlm"),
             cbind(tprFdpMed[[i]],method="medianRlm"))
tprFdpPlotMed[[i]] <- hlp %>%
      ggplot(aes(x=fdp,y=tpr,color=method)) +
      geom_path() +
      ggtitle(i)
}
tprFdpPlotMed
## $conditionb

## 
## $conditionc

## 
## $conditiond

## 
## $conditione

## 
## $`conditionc - conditionb`

## 
## $`conditiond - conditionb`

## 
## $`conditione - conditionb`

## 
## $`conditiond - conditionc`

## 
## $`conditione - conditionc`

## 
## $`conditione - conditiond`

The mean summarisation is vastly outperformed by robust summarisaton. This clearly indicates that differences in peptide species have to be accounted for in the summarisation.

5.2 Robust summarisation followed by robust ridge regression

msqrob2 can also be used to adopt parameter estimation using robust ridge regression by setting the argument ‘ridge=TRUE’. The performance of ridge regression generally improves for more complex designs with multiple conditions.

Note, that the parameter names for ridge regression always start with the string “ridge”. So we have to adjust the contrast matrix.

pe <- msqrob(object=pe,i="protein", formula=~condition,modelColumnName="ridge",ridge=TRUE)
Lridge<-L
rownames(Lridge)<-paste0("ridge",rownames(L))
pe <- hypothesisTest(object=pe,i="protein",contrast=Lridge,modelColumn="ridge",resultsColumnNamePrefix ="ridge")
## Warning in sqrt(sapply(models, varContrast, L = contrast)): NaNs produced

5.2.1 Volcano-plots

volcanoRidge<-list()
for (i in contrastNames)
{
volcanoRidge[[i]] <- rowData(pe[["protein"]])[[paste0("ridge",i)]] %>%
    ggplot(aes(x = logFC, y = -log10(pval),
           color = adjPval < 0.01)) +
    geom_point(cex = 2.5) +
    scale_color_manual(values = alpha(c("black", "red"), 0.5)) + theme_minimal() +
    geom_vline(xintercept=log2(FCs[i]),col="red") +
    ggtitle(paste("robust ridge",i))
}
volcanoRidge
## $conditionb

## 
## $conditionc

## 
## $conditiond

## 
## $conditione

## 
## $`conditionc - conditionb`

## 
## $`conditiond - conditionb`

## 
## $`conditione - conditionb`

## 
## $`conditiond - conditionc`

## 
## $`conditione - conditionc`

## 
## $`conditione - conditiond`

5.2.2 Sensitivity FDP plot

tprFdpRidge<-list()
tprFdpPlotRidge <- list()
for (i in contrastNames)
{
tprFdpRidge[[i]] <- tprFdp(rowData(pe[["protein"]])[[paste0("ridge",i)]]$pval,
             rowData(pe[["protein"]])$accession$ecoli,rowData(pe[["protein"]])[[paste0("ridge",i)]]$adjPval)

hlp <- rbind(cbind(tprFdps[[i]],method="robustRlm"),
             cbind(tprFdpMed[[i]],method="medianRlm"),
             cbind(tprFdpRidge[[i]],method="robustRidge"))
tprFdpPlotRidge[[i]] <- hlp %>%
      ggplot(aes(x=fdp,y=tpr,color=method)) +
      geom_path() +
      ggtitle(i)
}
tprFdpPlotRidge
## $conditionb

## 
## $conditionc

## 
## $conditiond

## 
## $conditione

## 
## $`conditionc - conditionb`

## 
## $`conditiond - conditionb`

## 
## $`conditione - conditionb`

## 
## $`conditiond - conditionc`

## 
## $`conditione - conditionc`

## 
## $`conditione - conditiond`

Ridge regression further improves the performance for allmost all comparisons.

6 Resolving Fit errors

For some proteins the models cannot be fitted because the design matrix is not full rank due to missingness. In our one-way anova design this happens because all protein expression values for one or more conditions are missing. Note, that we have to be very careful with the inference for these proteins because the reference level of factors can even change. Indeed for some proteins, especially spike-in proteins, the protein expression values for the lowest spike-in condition a are missing.

6.1 Fit models for proteins that are missing

6.1.1 Extract assay data for proteins with fitErrors

fitErrors <- which(sapply(rowData(pe[["protein"]])$msqrobModels,getFitMethod)=="fitError")
yFitErrors <- assay(pe[["protein"]])[fitErrors,]

There are 307 proteins with fit errors.

6.1.2 Fit custom models to proteins with fit errors.

We first calculate the prior degrees of freedom and the prior variance so as to calculate the posterior variances.

vars <- sapply(rowData(pe[["protein"]])$msqrobModels,getVar)
dfs <- sapply(rowData(pe[["protein"]])$msqrobModels,getDF)
priorEst<-limma::fitFDist(vars,dfs)

Next, we estimate the models by dropping the levels of the factor for which all observations are missing.

modelsFitErrors<-apply(yFitErrors,1,function(y,data,formula){
# Remove information for which we have missing values
sel<-!is.na(y)
y<-y[sel]
y<-matrix(y,nrow=1)
data<-subset(data,sel)

# Drop unused levels of factors of the data
factorColumns<-which(sapply(data,class)=="factor")
for (j in factorColumns)
data[[j]]<-droplevels(data[[j]])

# Fit the model and adopt Empirical Bayes variance estimation
out<-try(msqrobLm(y,formula,data)[[1]],silent=TRUE)
if  (class(out)=="try-error")
{
        out <- StatModel(type = "fitError",
                          params = list(coefficients = NA,         
                                        vcovUnscaled = NA,
                                        sigma = NA, df.residual = NA, w = NULL),
                          varPosterior = as.numeric(NA),
                          dfPosterior = as.numeric(NA))
} else {
  slot(out,"varPosterior") <- limma:::.squeezeVar(getVar(out),getDF(out),priorEst$scale,priorEst$df2)
  slot(out,"dfPosterior") <-getDF(out)+priorEst$df2
}
return(out)
},formula=~condition,data=colData(pe)
)

6.2 Inference

6.2.1 Inference for proteins with expression values for the reference level condition a

Calculate reference levels for proteins with fit errors

refLevels <- apply(yFitErrors,1,function(y,data){
  levels(droplevels(data$condition[!is.na(y)]))[1]
  },data=colData(pe))

Parameters for some conditions are missing. We therefore remove the parameters from the contrast for which the contrast equals zero to enable inference for models that contain all model parameters involving a particular contrast.

inferenceErrorRefLevelA<-list()
for (j in 1:ncol(L))
{
  contrast <- L[,j]
  contrast[contrast==0] <- NA
  inferenceErrorRefLevelA[[colnames(L)[j]]] <-   topFeatures(modelsFitErrors[refLevels=="a"],na.exclude(contrast),sort=FALSE)
}

6.3 Inference for proteins for which all expression values for reference condition “a” are missing

For these proteins the reference class has changed. We first set inference for all contrasts involving the reference level “a” equal to NA. We do this by setting all coefficients of the contrast equal to NA.

inferenceErrorRefLevelAltered<-list()
for (j in 1:4)
{
  contrast <- L[,j]
  contrast[]<-NA
  inferenceErrorRefLevelAltered[[colnames(L)[j]]] <-   topFeatures(modelsFitErrors[refLevels!="a"],na.exclude(contrast),sort=FALSE)
}

Next, we perform inference for the remaining contrasts and correct for the change in reference level.

protNamesHlp <- names(modelsFitErrors[refLevels!="a"])
for (j in 5:ncol(L))
{
inferenceErrorRefLevelAltered[[colnames(L)[j]]] <-
    sapply(protNamesHlp,function(i,models,contrast,refLevels){
        contrast[contrast==0]<-NA
        if (refLevels[i]!="a")
            contrast[paste0("condition",refLevels[i])] <- NA
        topFeatures(models[i],na.exclude(contrast)) %>% unlist
        },
        contrast=L[,j],
        refLevels=refLevels, models=modelsFitErrors[refLevels!="a"]) %>%
        t %>%
        data.frame
}

6.4 Combine inference for models with and without fitErrors.

Note, that we have to redo the FDR correction. Indeed the FDR is a set property and has to be calculated using all results (default msqrob2 + custom msqrob2 models) for each contrast. We store the results in the rowData of the assay in columns starting with “rlmOpt” followed by the name of the contrast.

for (j in colnames(L))
{
  # Combining the results
  featureTableHlp <-rowData(pe[["protein"]])[[j]]
  featureTableHlp[rownames(inferenceErrorRefLevelA[[j]]),]<-inferenceErrorRefLevelA[[j]]
  featureTableHlp[rownames(inferenceErrorRefLevelAltered[[j]]),]<-inferenceErrorRefLevelAltered[[j]]
  featureTableHlp$adjPval<-p.adjust(featureTableHlp$pval,"BH")

  # Store results in rowData of the protein assay
  rowData(pe[["protein"]])[[paste0("rlmOpt",j)]] <- featureTableHlp
}

6.5 Comparison of number proteins that can be analysed for a particular contrast

nProtDefault <- sapply(colnames(L),
function(j){
  sum(!is.na(rowData(pe[["protein"]])[[j]]$pval))
  })

nProtCustom <- sapply(colnames(L),
function(j){
sum(!is.na(rowData(pe[["protein"]])[[paste0("rlmOpt",j)]]$pval))
})

cbind(default=nProtDefault,custom=nProtCustom,extra=nProtCustom-nProtDefault)
##                         default custom extra
## conditionb                 4403   4453    50
## conditionc                 4403   4444    41
## conditiond                 4403   4450    47
## conditione                 4403   4447    44
## conditionc - conditionb    4403   4502    99
## conditiond - conditionb    4403   4510   107
## conditione - conditionb    4403   4505   102
## conditiond - conditionc    4403   4518   115
## conditione - conditionc    4403   4514   111
## conditione - conditiond    4403   4526   123

We observe that we can retrieve additional results for 41 to 123 proteins.

6.6 Performance

tprFdpRlmOpt<-tprFdpRlmOptPlot<-list()
for (i in contrastNames)
{
tprFdpRlmOpt[[i]] <- tprFdp(rowData(pe[["protein"]])[[paste0("rlmOpt",i)]]$pval %>% unlist,
             rowData(pe[["protein"]])$accession$ecoli,rowData(pe[["protein"]])[[paste0("rlmOpt",i)]]$adjPval %>% unlist)

hlp<-rbind(cbind(tprFdpRlmOpt[[i]],method="error"),
           cbind(tprFdps[[i]],method="noError"))

tprFdpRlmOptPlot[[i]] <- hlp %>%
    ggplot(aes(x=fdp,y=tpr,color=method)) +
    geom_path() +
    geom_vline(xintercept=0.01,lty=2)
    ggtitle(i)
}
tprFdpRlmOptPlot
## $conditionb

## 
## $conditionc

## 
## $conditiond

## 
## $conditione

## 
## $`conditionc - conditionb`

## 
## $`conditiond - conditionb`

## 
## $`conditione - conditionb`

## 
## $`conditiond - conditionc`

## 
## $`conditione - conditionc`

## 
## $`conditione - conditiond`

As expected, the ranking does not improve much for the comparisons involving the low spike-in concentration because the ecoli proteins are mainly missing for these conditions.